Range Indexing

Vitalware 2.0 saw the addition of several indexing methods to the database engine. In particular support for NULL indexing (whether a field is empty or not) and PARTIAL indexing (fast searching for leading characters, e.g. a*) was added. Tools were provided that allowed System Administrators to configure, via the Vitalware Registry, which fields required the new indexing methods. The new indexing facilities did not provide a mechanism for adjusting or tuning range indexes.

Vitalware 2.1.01 saw the introduction of additional tools that permit System Administrators to tune the range indexing used by Vitalware. Support for automatic optimization of range indexes has also been added. Using these tools Vitalware can now provide optimal range indexes with significantly faster range based searches.

Note: Included with the following description of Vitalware Range Indexing, you will find details about Range Indexing Registry entries. If you are specifically looking for details of Range Indexing Registry entries, see Range Buckets Registry entry.

The range indexing employed by the Vitalware database server is designed to work with the Two Level Superimposed Coding Scheme for Partial Match Retrieval system used for all searching. As it is an extension of the standard indexing Vitalware still requires only one set of index files, thus avoiding the need for costly "join" based queries.

Range indexing is really a series of "mini" indexes on a per field basis. Unlike the Two Level scheme, where all search terms are placed in the one index, range terms are placed in per field indexes that are then concatenated to form one range index. Each field index consists of a number of range buckets. These buckets are used to indicate whether a given value falls within the bucket or not.

There are two related considerations when establishing range buckets:

  1. Data distribution: as best as possible data should be distributed evenly between the buckets.
  2. Logical query ranges. The aim is to minimize the necessity to check a bucket for a value as checking whether records in a bucket match the query takes time; therefore if users are likely to search on particular ranges (decades for instance: 1/1/1910 to 1/1/1920) it makes sense to configure range buckets appropriately.

An example may make things clearer. Imagine that for statistical purposes we're recording the age of women when they give birth, anywhere between about 15 and 45. Let's say we establish the following range buckets:

Index

Note:

  1. -infinity and +infinity will capture any values outside the specified ranges.

When a range search is performed these buckets are used to determine possible matches. Consider the following searches for ages from:

  1. 15 to 35:

    Index

    In this case:

    1. Three buckets can be safely ignored.
    2. One bucket does not need to be checked and is definitely included in the search.
    3. Two buckets need to be checked for a match.
  2. 15 to 30:

    Index

    In this case:

    1. Four buckets can be safely ignored.
    2. One bucket does not need to be checked and is definitely included in the search.
    3. One bucket needs to be checked.
  3. 10 to 35:

    Index

    In this case:

    1. Three buckets can be safely ignored.
    2. Two buckets do not need to be checked and are definitely included in the search.
    3. One bucket needs to be checked.
  4. 10 to 30:

    Index

    In this case:

    1. Four buckets can be safely ignored.
    2. Two bucket do not need to be checked and are included in the search.
    3. No buckets need to be checked.

    In this case, no buckets need to be checked!

It should be clear from the last example that determining range buckets that match likely user searches has a direct influence on the speed of range based queries.

The problem with setting the range bucket values is that the bucket values depend on two variables. The first is the range of values entered into a field, in other words the data distribution. If the Width field contains a wide range of values, it makes sense to have a wide range of buckets. If, however, the data is centered around one point, it may make sense to have a series of range buckets cover this period.

The second variable is the query ranges used to retrieve data. The problem with this variable is that for a given field it may be very hard to determine what sort of query ranges will be used without performing extensive analysis.

As these two variables cannot be known before data has been loaded Vitalware provides a default set of range buckets for each range searchable field. In general these buckets are satisfactory for a small numbers of records, however significant reductions in search times may be achieved by "tuning" the range buckets for large numbers of records.

Tuning the range indexes involves considering the two variables, data distribution and query ranges, and for each variable determining the set of range buckets that provides optimal performance. The objective in fine-tuning range indexing is to:

  1. Minimize the number of buckets that need to be searched.
  2. Minimize the number of records in buckets that need to be searched. Note that this objective will take precedence over the default distribution of records evenly between buckets.

Vitalware 2.1.01 provides a mechanism that allows System Administrators to have the range buckets tuned automatically based on the data distribution within a field. The facility does not take query ranges into account as this information requires subjective interpretation to determine which queries are important to have optimized and which queries can run a little slower.

The new facility does however provide data distribution information that may be used by a System Administrator to set the range buckets manually to achieve optimal performance where query ranges are known and can be weighted.

Hence the new facility provides automated range buckets, but allows System Administrators to override these settings and configure their own buckets manually.